{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 感知机模型\n",
    "### 感知机定义\n",
    "感知机是由输入空间到输出空间的如下函数$$f(x)=sign(w*x+b)$$ 其中$w$和$b$称为感知机模型参数，$w\\in\\mathbb{R^n}$叫做权值向量，$b\\in\\mathbb{R}$叫做偏置(bias)，$w*x$是内积，$sign$是符号函数，即\n",
    "$$sign(x)=\\begin{cases}\n",
    "+1 & x>=0\\\\\n",
    "-1 & x<0\\end{cases}$$\n",
    "感知机是线性分类模型，属于判别类型。  \n",
    "几何解释如下：线性方程$$w*x+b=0$$对应特征$R^n$的一个超平面$S$，其中$w$是法向量，$b$是截距。超平面把空间分为两个部分。  \n",
    "如果存在超平面$S$把数据集正负实例点完全正确地划分到超平面的两侧，即对所有的$y_i=+1$，$w*x+b>0$；对所有的$y_i=-1$，$w*x+b<0$，则称数据集为线性可分数据集。\n",
    "### 感知机策略\n",
    "假设数据集线性可分，为确定把正负实例点完全正确分开的超平面，需要确定参数$w,b$，需要定义损失函数并极小化。  \n",
    "定义损失函数为误分类点到超平面的总距离  \n",
    "首先，任意一点到超平面的距离为$$\\frac{1}{||w||_2}|w*x_0+b|$$\n",
    "其次，对于误分类的点，$$-y_i*(w*x_i+b)>0$$正确分类的点有$w*x_i+b=0$  \n",
    "定义损失函数为$$L(w,b)=-\\sum_{x_i \\in M}y_i*(w*x_i+b)$$\n",
    "最小化损失函数即可。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "L0：非零元素个数，$||x||_0$  \n",
    "L1范数：非零元素绝对值的和$||x||_1$   \n",
    "L2范数：平方和再开方$||x||_2$  \n",
    "平面方程：$f(x_1,x_2,x_3,...,x_n)=0$    \n",
    "点$(x_0,y_0,z_0)$到平面$A*x+B*y+C*z+D=0$的距离：$\\frac{|A*x_0+B*y_0+C*z_0+D|}{\\sqrt{A^2+B^2+C^2}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "start reading data...\n",
      "reading data costs 3.656639 seconds\n",
      "start training\n",
      "tarining costs 1.593923 seconds\n",
      "start predicting\n",
      "predicting costs 3.937954 seconds\n",
      "the accuracy score is 0.983838\n"
     ]
    }
   ],
   "source": [
    "#encoding=utf-8\n",
    "import pandas as pd\n",
    "import random\n",
    "import time\n",
    "from sklearn.model_selection import train_test_split #sklearn.cross_validation不可用\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "class Perceptron(object):\n",
    "    \n",
    "    def __init__(self):\n",
    "        self.learning_step=0.001\n",
    "        self.max_iteration=5000\n",
    "        \n",
    "    def train(self, features, labels):\n",
    "        self.w=[0.0]*(len(features[0])+1)#初始化w,b为0，b在最后一位\n",
    "        correct_count=0\n",
    "        while correct_count<self.max_iteration:\n",
    "            index=random.randint(0,len(labels)-1)\n",
    "            x=list(features[index])\n",
    "            x.append(1.0)#方便与b相乘\n",
    "            y=2*labels[index]-1#label为1，化为1；label为0，化为-1\n",
    "            wx=sum([self.w[j]*x[j] for j in range(len(self.w))])\n",
    "            if wx*y>0:#分类正确\n",
    "                correct_count+=1\n",
    "                continue\n",
    "            for i in range(len(self.w)):#分类错误，更新w\n",
    "                self.w[i]+=self.learning_step*(y*x[i])#步长*w[i]的梯度\n",
    "                \n",
    "    def _predict(self, x):\n",
    "        wx=sum([self.w[j]*x[j] for j in range(len(self.w))])\n",
    "        return int(wx>0)\n",
    "    \n",
    "    def predict(self, features):\n",
    "        labels=[]\n",
    "        for feature in features:\n",
    "            x=list(feature)\n",
    "            x.append(1)\n",
    "            labels.append(self._predict(x))\n",
    "        return labels\n",
    "\n",
    "print(\"start reading data...\")\n",
    "time_1=time.time()\n",
    "raw_data=pd.read_csv('data/train_binary.csv', header=0)#header=0，第0行为列索引\n",
    "data=raw_data.values#从dataFrame转化为数组\n",
    "features=data[::,1::]#取每一行；从index=1的列开始，取每一列\n",
    "labels=data[::,0]#取每一行；只要第index=0的列\n",
    "train_features,test_features,train_labels,test_labels=train_test_split(features,\n",
    "                                                                       labels, \n",
    "                                                                       test_size=0.33, #测试集大小\n",
    "                                                                       random_state=0)#定下random_state值，那么每次划分结果一样\n",
    "time_2=time.time()\n",
    "print(\"reading data costs %f seconds\"%(time_2-time_1))\n",
    "\n",
    "print(\"start training\")\n",
    "p=Perceptron()\n",
    "p.train(train_features, train_labels)\n",
    "\n",
    "time_3=time.time()\n",
    "print('tarining costs %f seconds'%(time_3-time_2))\n",
    "\n",
    "print('start predicting')\n",
    "test_predict=p.predict(test_features)\n",
    "time_4=time.time()\n",
    "print('predicting costs %f seconds'%(time_4-time_3))\n",
    "\n",
    "score=accuracy_score(test_labels, test_predict)\n",
    "print('the accuracy score is %f'%score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   label  pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  \\\n",
      "0      1       0       0       0       0       0       0       0       0   \n",
      "1      0       0       0       0       0       0       0       0       0   \n",
      "2      1       0       0       0       0       0       0       0       0   \n",
      "3      4       0       0       0       0       0       0       0       0   \n",
      "4      0       0       0       0       0       0       0       0       0   \n",
      "\n",
      "   pixel8    ...     pixel774  pixel775  pixel776  pixel777  pixel778  \\\n",
      "0       0    ...            0         0         0         0         0   \n",
      "1       0    ...            0         0         0         0         0   \n",
      "2       0    ...            0         0         0         0         0   \n",
      "3       0    ...            0         0         0         0         0   \n",
      "4       0    ...            0         0         0         0         0   \n",
      "\n",
      "   pixel779  pixel780  pixel781  pixel782  pixel783  \n",
      "0         0         0         0         0         0  \n",
      "1         0         0         0         0         0  \n",
      "2         0         0         0         0         0  \n",
      "3         0         0         0         0         0  \n",
      "4         0         0         0         0         0  \n",
      "\n",
      "[5 rows x 785 columns]\n",
      "   label  pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  \\\n",
      "0      1       0       0       0       0       0       0       0       0   \n",
      "1      0       0       0       0       0       0       0       0       0   \n",
      "2      1       0       0       0       0       0       0       0       0   \n",
      "3      4       0       0       0       0       0       0       0       0   \n",
      "4      0       0       0       0       0       0       0       0       0   \n",
      "\n",
      "   pixel8    ...     pixel774  pixel775  pixel776  pixel777  pixel778  \\\n",
      "0       0    ...            0         0         0         0         0   \n",
      "1       0    ...            0         0         0         0         0   \n",
      "2       0    ...            0         0         0         0         0   \n",
      "3       0    ...            0         0         0         0         0   \n",
      "4       0    ...            0         0         0         0         0   \n",
      "\n",
      "   pixel779  pixel780  pixel781  pixel782  pixel783  \n",
      "0         0         0         0         0         0  \n",
      "1         0         0         0         0         0  \n",
      "2         0         0         0         0         0  \n",
      "3         0         0         0         0         0  \n",
      "4         0         0         0         0         0  \n",
      "\n",
      "[5 rows x 785 columns]\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data=pd.read_csv('data/02train.csv',header=0)#header=0表示第0行为列索引\n",
    "print(data.head(5))\n",
    "#解释s[i:j:k]是，根据该“片第从i到j与第k步”。何时i和j缺席，整个序列是和s[::k]意思是“每k个项目”"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
